Maxwell Mkondiwa (m.mkondiwa@cgiar.org) and Jordan Chamberlin (j.chamberlin@cgiar.org)
1 Introduction
This notebook provides R code for the paper on estimating the heterogenous treatment effects in Tanzania using non-parametric geoadditive and causal machine learning (ML) models. The analytics focus on addressing the following analysis questions:
# Lm List by yearbaseline_ols_yearList <-lmList(yld_~N_kgha+plotha+P_kgha+ncrops+hhmem+femhead+headeduc+arain_tot|year,tz_lsms_panel)coef(baseline_ols_yearList)
# Lm List by yearbaseline_quad_lmList <-lmList(yld_~N_kgha+I(N_kgha^2)+P_kgha+I(P_kgha^2)+ncrops+hhmem+femhead+headeduc+arain_tot|year,tz_lsms_panel)coef(baseline_quad_lmList )
## with interactions# Lm List by yearbaseline_quad_lmList_inter <-lmList(yld_~N_kgha*arain_tot+I(N_kgha^2)*arain_tot+P_kgha+I(P_kgha^2)+ncrops+hhmem+femhead+headeduc|year,tz_lsms_panel)coef(baseline_quad_lmList_inter )
# Quadratic plateau or linear plateaulibrary(easynls)tz_lsms_panel.temp <- tz_lsms_panel[, c("yld_","N_kgha")]nls.QP <-nlsfit(tz_lsms_panel.temp, model =4)nls.QP$Model
[1] "y ~ (a + b * x + c * I(x^2)) * (x <= -0.5 * b/c) + (a + I(-b^2/(4 * c))) * (x > -0.5 * b/c)"
Code
mn=nls.QP$Parametersmn
N_kgha
coefficient a -3.909774e-01
coefficient b 1.052253e-02
coefficient c -1.090000e-06
p-value t.test for a 2.841850e-01
p-value t.test for b 0.000000e+00
p-value t.test for c 0.000000e+00
r-squared 8.650000e-02
adjusted r-squared 8.640000e-02
AIC 8.165610e+04
BIC 8.168471e+04
maximum or minimum value for y 2.499715e+01
critical point in x 4.825482e+03
Code
summary(mn)
N_kgha
Min. : -0.39
1st Qu.: 0.00
Median : 0.09
Mean :14015.95
3rd Qu.: 1225.12
Max. :81684.71
5.1.1 Site-year specific Quadratic Only response functions
The site-year specific Quadratic response function can be modeled as At level 1, we have \[ Y_{i} = a_{i} + b_{i}*N + c_{i}*N^2 + \varepsilon_{i} \] At level 2, we have \[
a_{i} \sim N(a_0, \sigma_{a}^2) \\
b_{i} \sim N(b_0, \sigma_{b}^2) \\
c_{i} \sim N(c_0, \sigma_{c}^2) \\
\] This model can be estimated with the linear mixed model function lmer in R package lme4
Linear mixed model fit by REML ['lmerMod']
Formula: yld_ ~ 1 + N_kgha + N2 + (1 | year) + (0 + N_kgha | year) + (0 +
N2 | year)
Data: tz_lsms_panel.temp
REML criterion at convergence: 152692.7
Random effects:
Groups Name Std.Dev.
year (Intercept) 8.001e+01
year.1 N_kgha 2.748e+00
year.2 N2 6.472e-03
Residual 7.960e+02
Number of obs: 9426, groups: year, 5
Fixed Effects:
(Intercept) N_kgha N2
748.29010 11.24225 0.01232
fit warnings:
Some predictor variables are on very different scales: consider rescaling
optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
Code
summary(lmer.Q)
Linear mixed model fit by REML ['lmerMod']
Formula: yld_ ~ 1 + N_kgha + N2 + (1 | year) + (0 + N_kgha | year) + (0 +
N2 | year)
Data: tz_lsms_panel.temp
REML criterion at convergence: 152692.7
Scaled residuals:
Min 1Q Median 3Q Max
-2.3333 -0.6213 -0.3108 0.2772 6.6397
Random effects:
Groups Name Variance Std.Dev.
year (Intercept) 6.402e+03 8.001e+01
year.1 N_kgha 7.554e+00 2.748e+00
year.2 N2 4.188e-05 6.472e-03
Residual 6.337e+05 7.960e+02
Number of obs: 9426, groups: year, 5
Fixed effects:
Estimate Std. Error t value
(Intercept) 748.29010 36.88270 20.288
N_kgha 11.24225 1.75175 6.418
N2 0.01232 0.01525 0.808
Correlation of Fixed Effects:
(Intr) N_kgha
N_kgha -0.051
N2 0.046 -0.657
fit warnings:
Some predictor variables are on very different scales: consider rescaling
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
Or, althernatively, can be estimated with the non-linear mixed model function nlme
Code
library(nlme)nlme.Q <-nlme(yld_ ~ (a + b * N_kgha + c * (N_kgha^2)),data = tz_lsms_panel,fixed = a + b + c ~1,random = a + b + c ~1,groups =~ year,start =c(800, 10, -0.001))nlme.Q
Nonlinear mixed-effects model fit by maximum likelihood
Model: yld_ ~ (a + b * N_kgha + c * (N_kgha^2))
Data: tz_lsms_panel
Log-likelihood: -76345.33
Fixed: a + b + c ~ 1
a b c
747.82829275 11.16211183 0.01323599
Random effects:
Formula: list(a ~ 1, b ~ 1, c ~ 1)
Level: year
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a 57.45976820 a b
b 1.64304646 -0.022
c 0.02343941 0.946 -0.341
Residual 795.98608057
Number of Observations: 9426
Number of Groups: 5
5.1.2 Site-year specific Quadratic-plus-plateau response functions
The site-year specific response function can be modeled using a hierarchical quadratic-plus-plateau model. At level 1, we have \[ Y_{i} = \min(a_{i} + b_{i}*N + c_{i}*N^2, Y_{max}) + \varepsilon_{i} \] At level 2, we have \[
a_{i} \sim N(a_0, \sigma_{a}^2) \\
b_{i} \sim N(b_0, \sigma_{b}^2) \\
c_{i} \sim N(c_0, \sigma_{c}^2) \\
Y_{max} = a_{i} - b_{i}^2/(4*c_i)
\]
It seems the R function nlme would work to estimate this model
Code
# Define quadratic-plus-plateau functionmq <-lm(yld_ ~ N_kgha +I(N_kgha^2), data=tz_lsms_panel)a0 <-coef(mq)[[1]]b0 <-coef(mq)[[2]]c0 <-coef(mq)[[3]]clx0 <--0.5*b0/c0# Test nls# fx.QP <- function(N, a, b, c) {# y <- (a + b * N + c * I(N^2)) * (N <= -0.5 * b/c) + (a + I(-b^2/(4 * c))) * (N > -0.5 * b/c)# return(y)# }# # nls.QP <- nls(Y ~ fx.QP(N, a, b, c),# start = list(a = a0, b = b0, c = c0), data = dat.Puntel.CC.mean,# control = nls.control(maxiter = 6000))# quadplat <- function(x, a, b, clx) {# ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x, # a + b * clx + (-0.5*b/clx) * clx * clx)# }# # nls.QP <- nls(Y ~ quadplat(N, a, b, clx), # start = list(a = a0, b = b0, clx = clx0), data = dat.Puntel.CC.mean, # control = nls.control(maxiter = 6000))# nlme.QP <- nlme(Y ~ fx.QP(N, a, b, c),# data = dat.Puntel.CC.mean,# fixed = a + b + c ~ 1,# random = a + b + c ~ 1,# groups = ~ year,# start = c(a0, b0, c0))# # nlme.QP# tz_lsms_panel.nlme <- nlme(yld_ ~ (a + b * N_kgha + c * I(N_kgha ^2)) * (N_kgha <= (-0.5 * b/c)) + (a- I(b^2/(4 * c))) * (N_kgha > (-0.5 * b/c)),# data = tz_lsms_panel,# fixed = a + b + c ~ 1,# random = a + b + c ~ 1,# groups = ~ year,# start = c(a0, b0, c0))# tz_lsms_panel.nlme
Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:
Estimate Std. Error t value Pr(>t)
mean.forest.prediction 0.99724 0.10227 9.7509 < 2.2e-16 ***
differential.forest.prediction 1.25318 0.26830 4.6708 1.521e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:
Estimate Std. Error t value Pr(>t)
mean.forest.prediction 0.99353 0.06987 14.2196 < 2.2e-16 ***
differential.forest.prediction 1.23712 0.26364 4.6925 1.369e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
library(ggridges)library(dplyr)library(ggplot2)tau.multi_fert.forest_continuous <-predict(multi_fert.forest_continuous, target.sample ="all", estimate.variance =TRUE)tau.multi_fert.forest_continuous <-as.data.frame(tau.multi_fert.forest_continuous)tau.multi_fert.forest_X <-data.frame(tz_lsms_panel_estim_fert, tau.multi_fert.forest_continuous)NUE_Density=ggplot(tau.multi_fert.forest_X, aes(x = predictions, y ="", fill =factor(stat(quantile)))) +stat_density_ridges(geom ="density_ridges_gradient", calc_ecdf =TRUE,quantiles =4, quantile_lines =TRUE ) +scale_y_discrete(expand =c(0.01, 0)) +scale_fill_viridis_d(name ="Quartiles") +expand_limits(y =1) +theme_bw(base_size =16) +labs(x ="N use effect", y ="Density")NUE_Density
7 Exploring heterogeneity in Conditional N Use Efficiencies from CRF
7.1 Causal RF
Code
library(ggplot2)NperHa_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$N_kgha >0), aes(N_kgha, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Applied N per ha", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))NperHa_CATE_N_plot
Code
Plotsize_CATE_N_plot <-ggplot(tau.multi_fert.forest_X, aes(plotha, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Plot size (ha)", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))Plotsize_CATE_N_plot
Code
P_CATE_N_plot <-ggplot(tau.multi_fert.forest_X, aes(P_kgha, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="P", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))P_CATE_N_plot
Code
Hhmem_CATE_N_plot <-ggplot(tau.multi_fert.forest_X, aes(hhmem, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="HHMEM", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))Hhmem_CATE_N_plot
Code
# By yearNperHa_CATE_N_plot_yr <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$N_kgha >0), aes(N_kgha, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +lims(x =c(0, 100)) +labs(x ="Applied N per ha", y ="N use efficiency") +theme_bw(base_size =16) +facet_wrap(~year)NperHa_CATE_N_plot_yr
Code
# By soil organic carbonlibrary(ggplot2)soc_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$soc_5_15cm >0), aes(soc_5_15cm, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Soil organic carbon (%)", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))soc_CATE_N_plot
Code
# By soil sandlibrary(ggplot2)sand_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$sand_0_5cm >0), aes(sand_0_5cm, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Sand (%)", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))sand_CATE_N_plot
Code
# Electrical conductivitylibrary(ggplot2)soil_elec_cond_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$ECN_5_15cm >0), aes(ECN_5_15cm, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Soil electrical conductivity", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))soil_elec_cond_CATE_N_plot
Code
# By densitylibrary(ggplot2)pop_density_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$population_density >0), aes(population_density, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Population density", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))pop_density_CATE_N_plot
Code
# By elevationlibrary(ggplot2)elev_CATE_N_plot <-ggplot(subset(tau.multi_fert.forest_X, tau.multi_fert.forest_X$wc2.1_30s_elev >0), aes(wc2.1_30s_elev, predictions)) +geom_smooth(method ="loess", formula = y ~ x, col ="darkblue") +labs(x ="Elevation (masl)", y ="N use efficiency")previous_theme <-theme_set(theme_bw(base_size =16))elev_CATE_N_plot
8 Are N use efficiencies falling overtime?
8.1 Causal RF
Code
tau.multi_fert.forest_X$year_det[tau.multi_fert.forest_X$year==1]=2008tau.multi_fert.forest_X$year_det[tau.multi_fert.forest_X$year==2]=2010tau.multi_fert.forest_X$year_det[tau.multi_fert.forest_X$year==3]=2012tau.multi_fert.forest_X$year_det[tau.multi_fert.forest_X$year==4]=2014tau.multi_fert.forest_X$year_det[tau.multi_fert.forest_X$year==5]=2020tau.multi_fert.forest_X$year_det=as.numeric(tau.multi_fert.forest_X$year_det)Year_CATE_N_plot=ggplot(tau.multi_fert.forest_X,aes(year_det,predictions))+geom_smooth(method="loess",formula=y~x,col="darkblue")+labs(x="Year",y="N use efficiency")+scale_x_continuous(breaks =c(2008, 2010, 2012,2014,2020))previous_theme <-theme_set(theme_bw())Year_CATE_N_plot
library(ggridges)library(dplyr)library(ggplot2)tau.multi_fert.forest_dummy <-predict(multi_fert.forest_binary, target.sample ="all", estimate.variance =TRUE)tau.multi_fert.forest_dummy <-as.data.frame(tau.multi_fert.forest_dummy)tau.multi_fert.forest_X_dummy <-data.frame(tz_lsms_panel_estim_fert, tau.multi_fert.forest_dummy)ggplot(tau.multi_fert.forest_X_dummy, aes(x = predictions, y ="", fill =factor(stat(quantile)))) +stat_density_ridges(geom ="density_ridges_gradient", calc_ecdf =TRUE,quantiles =4, quantile_lines =TRUE ) +scale_y_discrete(expand =c(0.01, 0)) +scale_fill_viridis_d(name ="Quartiles") +expand_limits(y =1) +theme_bw(base_size =16) +labs(x ="Inorgatic fert use effect", y ="Density")
Code
#library(sp)library(sf)tau.multi_fert.forest_X_dummy_sp <-SpatialPointsDataFrame(cbind(tau.multi_fert.forest_X_dummy$lon_modified, tau.multi_fert.forest_X_dummy$lat_modified), data = tau.multi_fert.forest_X_dummy, proj4string =CRS("+proj=longlat +datum=WGS84"))library(tmap)tmap_mode("plot")Nuse_effect_map <-tm_shape(tau.multi_fert.forest_X_dummy_sp) +tm_dots(col ="predictions", title ="Effect of N use on yield (kg/ha)", style ="quantile") +tm_layout(legend.outside =TRUE)Nuse_effect_map
Code
tmap_save(Nuse_effect_map, "figures/Nuse_effect_map.png", width =600, height =600, asp =0)# N Use efficiency maplibrary(sp)library(sf)tau.multi_fert.forest_X_sp <-SpatialPointsDataFrame(cbind(tau.multi_fert.forest_X$lon_modified, tau.multi_fert.forest_X$lat_modified), data = tau.multi_fert.forest_X, proj4string =CRS("+proj=longlat +datum=WGS84"))library(tmap)tmap_mode("plot")Nuse_efficiency_map <-tm_shape(tau.multi_fert.forest_X_sp) +tm_dots(col ="predictions", title ="NUE (kg Maize/Kg N)", style ="quantile") +tm_layout(legend.outside =TRUE)Nuse_efficiency_map
Code
tmap_save(Nuse_efficiency_map, "figures/Nuse_efficiency_map.png", width =600, height =600, asp =0)# N Use maplibrary(sp)library(sf)tau.multi_fert.forest_X_sp_small <-subset(tau.multi_fert.forest_X_sp, tau.multi_fert.forest_X_sp$N_kgha >0)library(tmap)tmap_mode("plot")Nuse_map <-tm_shape(tau.multi_fert.forest_X_sp_small) +tm_dots(col ="N_kgha", title ="N applied (Kg/ha)", style ="quantile") +tm_layout(legend.outside =TRUE)Nuse_map
Code
tmap_save(Nuse_map, "figures/Nuse_map.png", width =600, height =600, asp =0)
TZ_sf=st_as_sf(TZ)TZ_sp=as_Spatial(TZ_sf)tau.multi_fert.forest_X_sp$region_name <-over(tau.multi_fert.forest_X_sp, TZ_sp[,"NAME_1"])# Summary by region and by yearlibrary(modelsummary)tau.multi_fert.forest_X_sp_dt <-tau.multi_fert.forest_X_sp@datatau.multi_fert.forest_X_sp_dt$region_name_final=tau.multi_fert.forest_X_sp_dt$region_name$NAME_1NUE_cont <-datasummary(Heading("region_name")*region_name_final~Heading("N_obs") * N +Heading("percent") *Percent()+ predictions*(Mean+SD),data = tau.multi_fert.forest_X_sp_dt, output="data.frame")NUE_cont$N_obs=as.numeric(NUE_cont$N_obs)NUE_cont$Mean=as.numeric(NUE_cont$Mean)NUE_cont$SD=as.numeric(NUE_cont$SD)library(reactable)library(htmltools)library(fontawesome)htmltools::browsable(tagList( tags$button(tagList(fontawesome::fa("download"), "Download as CSV"),onclick ="Reactable.downloadDataCSV('NUE_cont', 'NUE_cont.csv')" ),reactable( NUE_cont,searchable =TRUE,defaultPageSize =38,elementId ="NUE_cont" ) ))
Code
# Remove regions with fewer observations than 50NUE_cont=subset(NUE_cont,NUE_cont$N_obs>50)NUE_cont_sf=merge(TZ_sf,NUE_cont, by.x="NAME_1",by.y="region_name")# Map the library(tmap)tmap_mode("plot")tm_shape(NUE_cont_sf) +tm_polygons(col ="Mean", title ="Average NUE", style ="quantile") +tm_layout(legend.outside =TRUE)
pred <-SpatialPoints(elevationglobal_geodata_aoi)@coordspred <-as.data.frame(pred)names(pred)[1:2] <-c("lon", "lat")pred <- pred %>%drop_na()library(sp)pred_sp <-SpatialPointsDataFrame(cbind(pred$lon, pred$lat), data = pred, proj4string =CRS("+proj=longlat +datum=WGS84"))yield_gain_hat <-predict(b, newdata = pred)yield_gain_hat <-as.data.frame(yield_gain_hat)pred_yield_gain_hat <-cbind(pred, yield_gain_hat )pred_yield_gain_hat$sigma <-NULLlibrary(terra)yield_gain_ras <-rast(pred_yield_gain_hat, type ="xyz")plot(yield_gain_ras )
Code
library(raster)yield_gain_ras2 <-raster(yield_gain_ras)library(sf)TZ_sf_dis_sp <-as_Spatial(TZ_sf_dis)yield_gain_ras2_2008 <-mask(yield_gain_ras2, TZ_sf_dis_sp)plot(yield_gain_ras2_2008, main ="Yield gains to N")